model : add PLaMo-2 model #14560

mitmul · 2025-07-07T05:08:38Z

This PR supports PLaMo2 model in llama.cpp, which was also requested on a related discussion thread: #13874. This model uses a custom-implemented tokenizer, so this PR includes both the model itself (which uses an architecture combining Mamba and Attention, similar to Jamba) as well as implementing the new custom tokenizer.

Based on #7531

How to check if the plamo-2-translate works with this PR. First, retrieve the model itself by:

git clone https://huggingface.co/pfnet/plamo-2-translate

Then, I needed to modify the tokenizer.jsonl to pad some meaningless vocabs to align the vocabulary size to what is specified in config.json, namely it should be 100032 by using this script:

#!/usr/bin/env python3
"""Fix PLaMo-2 tokenizer by adding missing padding tokens."""

import json
import shutil

def fix_tokenizer():
    # Backup original file
    shutil.copy("plamo-2-translate/tokenizer.jsonl", "plamo-2-translate/tokenizer.jsonl.backup")
    
    # Read existing tokens
    with open("plamo-2-translate/tokenizer.jsonl", "r", encoding="utf-8") as f:
        lines = f.readlines()
    
    print(f"Current number of tokens: {len(lines)}")
    
    # Add 32 padding tokens
    # Use the same format as other special tokens in the file
    for i in range(32):
        token_id = 100000 + i
        # Create padding token with same format as other special tokens
        padding_token = [f"<pad_{i}>", 0.0, "CONTROL", "basic", 8, None, [[0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0]]]
        lines.append(json.dumps(padding_token, ensure_ascii=False) + "\n")
    
    # Write back
    with open("plamo-2-translate/tokenizer.jsonl", "w", encoding="utf-8") as f:
        f.writelines(lines)
    
    print(f"New number of tokens: {len(lines)}")
    print("Tokenizer fixed!")

if __name__ == "__main__":
    fix_tokenizer()

Next, convert the model into gguf by the following command:

python convert_hf_to_gguf.py plamo-2-translate --outfile plamo-2-translate.gguf --outtype auto

Then build binaries as follows:

cmake -B release -DLLAMA_CURL=OFF -DGGML_USE_BLAS=ON
cmake --build release --config Release

and finally, I successfully run the plamo-2-translate model as follows:

./release/bin/llama-cli -m plamo-2-translate.gguf -p "<|plamo:op|>dataset\ntranslation\n<|plamo:op|>input lang=English\nHello, how are you?\n<|plamo:op|>output\n" -no-cnv --verbose-prompt --no-warmup -sp

intermediate outputs

build: 5876 (272ffdb6) with Apple clang version 17.0.0 (clang-1700.0.13.5) for arm64-apple-darwin24.5.0
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_load_from_file_impl: using device Metal (Apple M1 Max) - 64424 MiB free
llama_model_loader: loaded meta data with 37 key-value pairs and 467 tensors from plamo-2-translate.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = plamo2
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Plamo 2 Translate
llama_model_loader: - kv   3:                         general.size_label str              = 10B
llama_model_loader: - kv   4:                            general.license str              = other
llama_model_loader: - kv   5:                       general.license.name str              = plamo-community-license
llama_model_loader: - kv   6:                       general.license.link str              = https://huggingface.co/pfnet/plamo-2-...
llama_model_loader: - kv   7:                   general.base_model.count u32              = 1
llama_model_loader: - kv   8:                  general.base_model.0.name str              = Plamo 2 8b
llama_model_loader: - kv   9:          general.base_model.0.organization str              = Pfnet
llama_model_loader: - kv  10:              general.base_model.0.repo_url str              = https://huggingface.co/pfnet/plamo-2-8b
llama_model_loader: - kv  11:                               general.tags arr[str,3]       = ["plamo", "translation", "translation"]
llama_model_loader: - kv  12:                          general.languages arr[str,2]       = ["en", "ja"]
llama_model_loader: - kv  13:             plamo2.attention.head_count_kv arr[i32,32]      = [0, 4, 0, 4, 0, 4, 0, 4, 0, 4, 0, 4, ...
llama_model_loader: - kv  14:                      plamo2.context_length u32              = 10485760
llama_model_loader: - kv  15:                    plamo2.embedding_length u32              = 4096
llama_model_loader: - kv  16:                         plamo2.block_count u32              = 32
llama_model_loader: - kv  17:                plamo2.attention.head_count u32              = 32
llama_model_loader: - kv  18:    plamo2.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  19:        plamo2.attention.group_norm_epsilon f32              = 0.000001
llama_model_loader: - kv  20:        plamo2.attention.layer_norm_epsilon f32              = 0.000001
llama_model_loader: - kv  21:                      plamo2.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  22:                      plamo2.ssm.state_size u32              = 64
llama_model_loader: - kv  23:                     plamo2.ssm.conv_kernel u32              = 4
llama_model_loader: - kv  24:                  plamo2.ssm.time_step_rank u32              = 64
llama_model_loader: - kv  25:                      plamo2.ssm.inner_size u32              = 8192
llama_model_loader: - kv  26:                     plamo2.ssm.group_count u32              = 0
llama_model_loader: - kv  27:                 plamo2.feed_forward_length u32              = 16384
llama_model_loader: - kv  28:                          general.file_type u32              = 0
llama_model_loader: - kv  29:               general.quantization_version u32              = 2
llama_model_loader: - kv  30:                       tokenizer.ggml.model str              = plamo2
llama_model_loader: - kv  31:                         tokenizer.ggml.pre str              = default
llama_model_loader: - kv  32:                      tokenizer.ggml.tokens arr[str,100032]  = ["<|plamo:unk|>", "<|plamo:bos|>", "<...
llama_model_loader: - kv  33:                      tokenizer.ggml.scores arr[f32,100032]  = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  34:                  tokenizer.ggml.token_type arr[i32,100032]  = [2, 3, 3, 3, 3, 3, 3, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  35:                tokenizer.ggml.eos_token_id u32              = 4
llama_model_loader: - kv  36:            tokenizer.ggml.add_space_prefix bool             = false
llama_model_loader: - type  f32:  467 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = all F32
print_info: file size   = 35.50 GiB (32.00 BPW) 
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
load: special tokens cache size = 61
load: token to piece cache size = 0.7989 MB
print_info: arch             = plamo2
print_info: vocab_only       = 0
print_info: n_ctx_train      = 10485760
print_info: n_embd           = 4096
print_info: n_layer          = 32
print_info: n_head           = 32
print_info: n_head_kv        = [0, 4, 0, 4, 0, 4, 0, 4, 0, 4, 0, 4, 0, 4, 0, 4, 0, 4, 0, 4, 0, 4, 0, 4, 0, 4, 0, 4, 0, 4, 0, 4]
print_info: n_rot            = 128
print_info: n_swa            = 0
print_info: is_swa_any       = 0
print_info: n_embd_head_k    = 128
print_info: n_embd_head_v    = 128
print_info: n_gqa            = [0, 8, 0, 8, 0, 8, 0, 8, 0, 8, 0, 8, 0, 8, 0, 8, 0, 8, 0, 8, 0, 8, 0, 8, 0, 8, 0, 8, 0, 8, 0, 8]
print_info: n_embd_k_gqa     = [0, 512, 0, 512, 0, 512, 0, 512, 0, 512, 0, 512, 0, 512, 0, 512, 0, 512, 0, 512, 0, 512, 0, 512, 0, 512, 0, 512, 0, 512, 0, 512]
print_info: n_embd_v_gqa     = [0, 512, 0, 512, 0, 512, 0, 512, 0, 512, 0, 512, 0, 512, 0, 512, 0, 512, 0, 512, 0, 512, 0, 512, 0, 512, 0, 512, 0, 512, 0, 512]
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-06
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: f_attn_scale     = 0.0e+00
print_info: n_ff             = 16384
print_info: n_expert         = 0
print_info: n_expert_used    = 0
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 2
print_info: rope scaling     = linear
print_info: freq_base_train  = 1000000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 10485760
print_info: rope_finetuned   = unknown
print_info: ssm_d_conv       = 4
print_info: ssm_d_inner      = 8192
print_info: ssm_d_state      = 64
print_info: ssm_dt_rank      = 64
print_info: ssm_n_group      = 0
print_info: ssm_dt_b_c_rms   = 0
print_info: model type       = 8B
print_info: model params     = 9.53 B
print_info: general.name     = Plamo 2 Translate
print_info: vocab type       = PLaMo2
print_info: n_vocab          = 100032
print_info: n_merges         = 0
print_info: BOS token        = 1 '<|plamo:bos|>'
print_info: EOS token        = 4 '<|plamo:op|>'
print_info: UNK token        = 0 '<|plamo:unk|>'
print_info: PAD token        = 3 '<|plamo:pad|>'
print_info: LF token         = 10 '
'
print_info: EOG token        = 4 '<|plamo:op|>'
print_info: max token length = 48
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: offloading 32 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 33/33 layers to GPU
load_tensors: Metal_Mapped model buffer size = 34784.34 MiB
load_tensors:   CPU_Mapped model buffer size =  1563.00 MiB
.............................................................................................
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 4096
llama_context: n_ctx_per_seq = 4096
llama_context: n_batch       = 2048
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = 0
llama_context: freq_base     = 1000000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_per_seq (4096) < n_ctx_train (10485760) -- the full capacity of the model will not be utilized
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M1 Max
ggml_metal_init: picking default device: Apple M1 Max
ggml_metal_load_library: using embedded metal library
ggml_metal_init: GPU name:   Apple M1 Max
ggml_metal_init: GPU family: MTLGPUFamilyApple7  (1007)
ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_init: GPU family: MTLGPUFamilyMetal3  (5001)
ggml_metal_init: simdgroup reduction   = true
ggml_metal_init: simdgroup matrix mul. = true
ggml_metal_init: has residency sets    = true
ggml_metal_init: has bfloat            = true
ggml_metal_init: use bfloat            = false
ggml_metal_init: hasUnifiedMemory      = true
ggml_metal_init: recommendedMaxWorkingSetSize  = 67554.51 MB
ggml_metal_init: skipping kernel_get_rows_bf16                     (not supported)
ggml_metal_init: skipping kernel_set_rows_bf16                     (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32                   (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32_c4                (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32_1row              (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32_l4                (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_bf16                  (not supported)
ggml_metal_init: skipping kernel_mul_mv_id_bf16_f32                (not supported)
ggml_metal_init: skipping kernel_mul_mm_bf16_f32                   (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_bf16_f16                (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h64           (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h80           (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h96           (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h112          (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h128          (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h192          (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_hk192_hv128   (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h256          (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_hk576_hv512   (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_h64       (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_h96       (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_h128      (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_h192      (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_hk192_hv128 (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_h256      (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_hk576_hv512 (not supported)
ggml_metal_init: skipping kernel_cpy_f32_bf16                      (not supported)
ggml_metal_init: skipping kernel_cpy_bf16_f32                      (not supported)
ggml_metal_init: skipping kernel_cpy_bf16_bf16                     (not supported)
llama_context:        CPU  output buffer size =     0.38 MiB
llama_kv_cache_unified:      Metal KV buffer size =   128.00 MiB
llama_kv_cache_unified: size =  128.00 MiB (  4096 cells,  16 layers,  1 seqs), K (f16):   64.00 MiB, V (f16):   64.00 MiB
llama_kv_cache_unified: LLAMA_SET_ROWS=0, using old ggml_cpy() method for backwards compatibility
llama_memory_recurrent: mem_size = 1, n_seq_max = 1, type_r = 'f32', type_s = 'f32', n_layer = 32
llama_memory_recurrent:      Metal KV buffer size =    33.50 MiB
llama_memory_recurrent: KV self size  =   33.50 MiB, R (f32):    1.50 MiB, S (f32):   32.00 MiB
llama_context:      Metal compute buffer size =   306.10 MiB
llama_context:        CPU compute buffer size =    16.01 MiB
llama_context: graph nodes  = 2038
llama_context: graph splits = 9
common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
main: llama threadpool init, n_threads = 8

system_info: n_threads = 8 (n_threads_batch = 8) / 10 | Metal : EMBED_LIBRARY = 1 | CPU : NEON = 1 | ARM_FMA = 1 | FP16_VA = 1 | DOTPROD = 1 | LLAMAFILE = 1 | ACCELERATE = 1 | REPACK = 1 | 

main: prompt: '<|plamo:op|>dataset
translation
<|plamo:op|>input lang=English
Hello, how are you?
<|plamo:op|>output
'
main: number of tokens in prompt = 20
     4 -> '<|plamo:op|>'
 45474 -> 'dataset'
    10 -> '
'
 18053 -> 'translation'
    10 -> '
'
     4 -> '<|plamo:op|>'
  1760 -> 'input'
 98700 -> ' lang'
    61 -> '='
 14134 -> 'English'
    10 -> '
'
  6721 -> 'Hello'
    44 -> ','
  1205 -> ' how'
  1089 -> ' are'
  1099 -> ' you'
  1076 -> '?
'
     4 -> '<|plamo:op|>'
  3045 -> 'output'
    10 -> '
'

sampler seed: 64554044
sampler params: 
	repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
	dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 4096
	top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.800
	mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist 
generate: n_ctx = 4096, n_batch = 2048, n_predict = -1, n_keep = 0

Output:

<|plamo:op|>dataset
translation
<|plamo:op|>input lang=English
Hello, how are you?
<|plamo:op|>output
こんにちは、ご機嫌いかがですか？
<|plamo:op|> [end of text]


llama_perf_sampler_print:    sampling time =       0.29 ms /    26 runs   (    0.01 ms per token, 89347.08 tokens per second)
llama_perf_context_print:        load time =    6939.57 ms
llama_perf_context_print: prompt eval time =     378.42 ms /    20 tokens (   18.92 ms per token,    52.85 tokens per second)
llama_perf_context_print:        eval time =     625.90 ms /     5 runs   (  125.18 ms per token,     7.99 tokens per second)
llama_perf_context_print:       total time =    7566.19 ms /    25 tokens
ggml_metal_free: deallocating

Seems correctly working!

This will be necessary to support Jamba (and other recurrent models mixed with Attention). Doesn't compile yet, and finding a slot isn't yet done correctly for recurrent states.

* llama : begin work on support for variable GQA This will also be useful for Jamba if we consider the Mamba layers to have 0 KV heads. * llama : gracefully fail when not finding hybrid slot

* ggml : simplify SSM-related operators * llama : make recurrent state slot allocation contiguous * llama : adapt internal uses of batches to llama_ubatch

This reduces overhead when running hellaswag on thousands of sequences with very small 100k params Mamba models.

This otherwise was a problem when running the HellaSwag benchmark with small batch sizes, making it crash.

This removes the need for ggml_ssm_conv!!! But performance seems slighly worse on my system, especially for prompt processing. Maybe ggml_mul_mat isn't optimized for small row sizes? More performance testing is necessary until GGML_OP_SSM_CONV is removed. * ggml : make ggml_ssm_scan not modify its source tensors * llama : fix shared recurrent tail cell count for small ubatch sizes Otherwise it was impossible to run the 'parallel' example with '-ub 1' with a Mamba or Jamba model.

* ggml : allow GGML_OP_CONCAT to work on non-contiguous tensors The implementation already supported it, and this makes Mamba's conv step slightly faster.

This can be changed back later if the name change is wrong. I was renaming the functions anyway to generalize kv-cache-related functions to hybrid and recurrent model architectures. I think llama_past is a better name than llama_cache for a combined kv cache and recurrent state cache, because the states it contains pretty much always come before the newly-added ones for any particular sequence. Also 'llama_past_clear' sounds more obvious in what it does than 'llama_kv_cache_clear'. The future is what the models generate. (For embeddings, the kv cache isn't really used anyway) Still, I'm open to better suggestions.

src/llama-model.cpp

Co-authored-by: Sigbjørn Skjæret <[email protected]>

…mo2_session to follow the other tokenizer implementations

src/llama-vocab.cpp

Co-authored-by: Georgi Gerganov <[email protected]>

src/llama-model.cpp

CISC · 2025-07-14T18:13:42Z

Thank you for testing this PR. As for the special token <|plamo:op|> as the stop token, I added the following line to tokenizer_config.json before converting the model from HF format into GGML:
"eos_token_id": 4,
This makes the <|plamo:op|> token to be a stop token with the tokenizer without specifying any special options in runtime.

We can't expect users to do this, I think the better option would be to add this token as EOT at conversion.

convert_hf_to_gguf.py

Co-authored-by: Sigbjørn Skjæret <[email protected]>

CISC

Vocab needs to be padded or else loading embedded tokens will fail.

convert_hf_to_gguf.py

Co-authored-by: Sigbjørn Skjæret <[email protected]>

mitmul · 2025-07-15T06:52:27Z

That's right. Thank you for the suggested changes.

CISC

Tokenizer is super slow, quite possibly something wrong, please check it out, but test-tokenizer-0 passes.

mitmul · 2025-07-15T07:17:36Z

Tokenizer is super slow, quite possibly something wrong, please check it out, but test-tokenizer-0 passes.

Ah, I found that building Aho tree is performed every time tokenize() is called. It should be performed once when init_tokenizer() is called. I'll fix it in another PR.

CISC · 2025-07-15T07:53:51Z

Tokenizer is super slow, quite possibly something wrong, please check it out, but test-tokenizer-0 passes.

Ah, I found that building Aho tree is performed every time tokenize() is called. It should be performed once when init_tokenizer() is called. I'll fix it in another PR.

There's time to fix it now. :)

mitmul · 2025-07-15T15:16:54Z

Tokenizer is super slow, quite possibly something wrong, please check it out, but test-tokenizer-0 passes.

Ah, I found that building Aho tree is performed every time tokenize() is called. It should be performed once when init_tokenizer() is called. I'll fix it in another PR.

There's time to fix it now. :)

Thanks, I think it's fixed with 6921534

CISC · 2025-07-15T16:30:44Z

I will add vocab files to HF for CI in a week or so (so as not to break CI for everyone not in sync with master).

compilade added 30 commits April 3, 2024 20:47

wip: llama : separate recurrent states from the KV cache

271104c

This will be necessary to support Jamba (and other recurrent models mixed with Attention). Doesn't compile yet, and finding a slot isn't yet done correctly for recurrent states.

llama : use std::find for seq_nodes in llama_rs_cache

8db1e4d

llama : state checkpoints for recurrent models

0028010

llama : correctly handle more edge cases for the rs cache

0c8b3b2

Merge branch 'master' into compilade/refactor-kv-cache

d66849f

llama : rename many llama_kv_cache_* functions

a09db95

Merge branch 'master' into compilade/refactor-kv-cache

c460ff1

llama : remove useless return value for some llama_cache_* functions

b6fafd1

Merge branch 'master' into compilade/refactor-kv-cache

b7ec12e

Merge branch 'master' into compilade/refactor-kv-cache

3b57b55

llama : rethink recurrent state cell counts

7e13f19

* llama : begin work on support for variable GQA This will also be useful for Jamba if we consider the Mamba layers to have 0 KV heads. * llama : gracefully fail when not finding hybrid slot

llama : support Jamba

cbc743e

Merge branch 'master' into compilade/refactor-kv-cache

0fd13e9

llama : fix BERT inference without KV cache

61a88a1

convert-hf : check for unprocessed Jamba experts

ea2e63e

convert-hf : support Mini-Jamba conversion

fc59407

llama : fix Jamba quantization sanity checks

181dadf

llama : sequence-length-aware batch splitting

3a414b0

Merge branch 'master' into compilade/refactor-kv-cache

4e4c41e

llama : use equal-sequence-length sub-batches for recurrent models

3587a94

* ggml : simplify SSM-related operators * llama : make recurrent state slot allocation contiguous * llama : adapt internal uses of batches to llama_ubatch

Merge branch 'master' into compilade/refactor-kv-cache

5d3c7b9

llama : fix batch split output count for embeddings

72eea49

llama : minimize swaps when reordering logits

18d1c14

This reduces overhead when running hellaswag on thousands of sequences with very small 100k params Mamba models.

llama : fix edge case finding batch seq_id of split recurrent cell

61200ef

This otherwise was a problem when running the HellaSwag benchmark with small batch sizes, making it crash.

llama : avoid copies for simple batch splits

eb589d5

llama : fix .base() compilation error on Windows

17f6c1e

llama : allow doing the equivalent of SSM_CONV with SUM_ROWS and MUL

fee3c1d

* ggml : allow GGML_OP_CONCAT to work on non-contiguous tensors The implementation already supported it, and this makes Mamba's conv step slightly faster.

Merge branch 'master' into compilade/refactor-kv-cache

6840ac0

mitmul added 4 commits July 12, 2025 13:44

Use LLM_FFN_SWIGLU instead of splitting ffn_gate and ffn_up

498b8b3

Remove unnecessary part for Grouped Query Attention

6afd3be

Fix how to load special token id to gguf

34360eb

Remove unused tensor mapping

71abd3a

CISC reviewed Jul 12, 2025

View reviewed changes

src/llama-model.cpp Outdated Show resolved Hide resolved

mitmul and others added 2 commits July 12, 2025 16:31

Update src/llama-model.cpp

fb2ae69

Co-authored-by: Sigbjørn Skjæret <[email protected]>

Remove llama_vocab_plamo2 class and replace it with llm_tokenizer_pla…

eea696e

…mo2_session to follow the other tokenizer implementations

mitmul force-pushed the mitmul/add-plamo2 branch from c805d75 to eea696e Compare July 14, 2025 06:58

ggerganov approved these changes Jul 14, 2025

View reviewed changes

src/llama-vocab.cpp Outdated Show resolved Hide resolved

ggerganov requested a review from CISC July 14, 2025 11:16

Update src/llama-vocab.cpp

841ffc8

Co-authored-by: Georgi Gerganov <[email protected]>

CISC requested changes Jul 14, 2025

View reviewed changes

src/llama-model.cpp Outdated Show resolved Hide resolved

src/llama-model.cpp Outdated Show resolved Hide resolved

CISC reviewed Jul 14, 2025

View reviewed changes

convert_hf_to_gguf.py Show resolved Hide resolved

mitmul and others added 4 commits July 15, 2025 09:11

Update convert_hf_to_gguf.py

35d8188

Co-authored-by: Sigbjørn Skjæret <[email protected]>

Update src/llama-model.cpp

d134e7f

Co-authored-by: Sigbjørn Skjæret <[email protected]>

Update src/llama-model.cpp

921e864

Co-authored-by: Sigbjørn Skjæret <[email protected]>

Merge remote-tracking branch 'upstream/master' into mitmul/add-plamo2

f87ac1c

CISC requested changes Jul 15, 2025

View reviewed changes

convert_hf_to_gguf.py Outdated Show resolved Hide resolved

convert_hf_to_gguf.py Outdated Show resolved Hide resolved

mitmul and others added 2 commits July 15, 2025 15:51

Update convert_hf_to_gguf.py

7b0b2ea

Co-authored-by: Sigbjørn Skjæret <[email protected]>

Update convert_hf_to_gguf.py

b42f95d

Co-authored-by: Sigbjørn Skjæret <[email protected]>

CISC approved these changes Jul 15, 2025

View reviewed changes

Fix plamo2 tokenizer session to prevent multiple calls of build()

6921534

CISC merged commit 68e37a6 into ggml-org:master Jul 15, 2025
48 of 51 checks passed

mitmul deleted the mitmul/add-plamo2 branch July 15, 2025 16:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

model : add PLaMo-2 model #14560

model : add PLaMo-2 model #14560

mitmul commented Jul 7, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

CISC commented Jul 14, 2025

Uh oh!

Uh oh!

CISC left a comment

Uh oh!

Uh oh!

Uh oh!

mitmul commented Jul 15, 2025

Uh oh!

CISC left a comment

Uh oh!

mitmul commented Jul 15, 2025

Uh oh!

CISC commented Jul 15, 2025

Uh oh!

mitmul commented Jul 15, 2025

Uh oh!

Uh oh!

CISC commented Jul 15, 2025

Uh oh!

Uh oh!

model : add PLaMo-2 model #14560

model : add PLaMo-2 model #14560

Conversation

mitmul commented Jul 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

CISC commented Jul 14, 2025

Uh oh!

Uh oh!

CISC left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

mitmul commented Jul 15, 2025

Uh oh!

CISC left a comment

Choose a reason for hiding this comment

Uh oh!

mitmul commented Jul 15, 2025

Uh oh!

CISC commented Jul 15, 2025

Uh oh!

mitmul commented Jul 15, 2025

Uh oh!

Uh oh!

CISC commented Jul 15, 2025

Uh oh!

Uh oh!

mitmul commented Jul 7, 2025 •

edited

Loading